1 Background

We hope to explore the relative influence of physical traits, environmental conditions and species identity on the growth rate of trees. A gradient boosted model seems like a good candidate for this work since they:

1.1 Extracting Principle Components for Environmental Traits

We, first, converted the environmental variables to principle components as they were highly correlated. We visualized the PCA and used the eginvectors to help figure which environmental condition best explained that PC. There were 5 - Soil Fertility, Light, Temperature, pH, Soil.Humidity.Depth, and Slope.

1.1.1 PC1-PC2

1.1.2 P3-PC4

1.1.3 PC5-PC6

1.1.4 Correlation on Plant Traits

We want to ensure that the plant traits are not correlated. Past work suggests that they are not easily represented using a PCA. So, we will not use the this feature reduction method.

1.2 About Gradient Boosted Models

A gradient boosted machine/model is a machine learning model that uses decision trees to fit the data.

A decision tree first starts with all of the observations, then, from the variables provided, it tries to figure out which variable split would result in the “purest” groupings of the data. So, in this case, it would try to place rows with higher growth rates in one node, and those with lower growth rates in another node.

GBMs are an ensemble of decision trees, nut they are fit sequentially. We call GBMs an ensemble of weak learners as each subsequent tree is an attempt to correct the errors of the previous tree. Thus, while one tree, by itself, can not describe the relationships, with the use of all the trees, we can. Below is a figure by Bradly Bohemke that attempts to illustrate how each subsequent tree improves the fit on the data. Boosted regression decision stumps as 0-1024 successive trees are added

2 Compare Models

We compared the fit of three used a gradient boosted models to determine how environmental gradients and physical traits influence RGR:

2.1 Model 1: Tree Age + Plant Trait + Environmental Conditions

2.1.1 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_57"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 11
## 
## $min_rows
## [1] 1
## 
## $nbins
## [1] 16
## 
## $nbins_cats
## [1] 256
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3395.824
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.49
## 
## $col_sample_rate
## [1] 0.42
## 
## $col_sample_rate_per_tree
## [1] 0.34
## 
## $min_split_improvement
## [1] 1e-08
## 
## $histogram_type
## [1] "UniformAdaptive"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                 "Slope"             
##  [6] "Estem"              "Branching.Distance" "Stem.Wood.Density"  "Leaf.Area"          "LMA"               
## [11] "LCC"                "LNC"                "LPC"                "d15N"               "t.b2"              
## [16] "Ks"                 "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Tree.Age"           "julian.date.2011"  
## 
## $y
## [1] "BAI_GR"

2.1.2 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bai_residuals <-
  gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BAI_GR", "julian.date.2011"))),
      n.trees = 1000,
      interaction.depth = 11, #max depth 
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 10, #col_sample_rate 
      bag.fraction = 0.49, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

2.1.3 Relative Importance

First, we look at the importance of variables in the model.

2.1.4 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass Per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity Per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta 13C

Tree Age

Julian Date in 2011

2.1.5 Performance

How does the model perform when we use the true individual trait value?

2.1.6 Interactions | Table

Let’s explore the interactions in these data.

2.1.7 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 21.524, df = 3, p-value = 8.194e-05
##                                                                                  Comparison          Z      P.unadj
## 1 Environmental Conditions:Environmental Conditions - Environmental Conditions:Plant Traits  0.5730733 5.665950e-01
## 2 Environmental Conditions:Environmental Conditions - Plant Traits:Environmental Conditions -2.9581361 3.095055e-03
## 3             Environmental Conditions:Plant Traits - Plant Traits:Environmental Conditions -1.7512647 7.990033e-02
## 4             Environmental Conditions:Environmental Conditions - Plant Traits:Plant Traits -4.2940685 1.754283e-05
## 5                         Environmental Conditions:Plant Traits - Plant Traits:Plant Traits -2.2793383 2.264696e-02
## 6                         Plant Traits:Environmental Conditions - Plant Traits:Plant Traits -1.4936179 1.352755e-01
##         P.adj
## 1 0.566595039
## 2 0.015475273
## 3 0.239700984
## 4 0.000105257
## 5 0.090587842
## 6 0.270551058

2.1.8 Interaction | Group | Violin

2.1.9 Interaction | Group | Boxplot

2.1.10 Interaction | Group | Density

2.1.11 Interaction | Group | Top

2.1.12 Interactions | Plots

Now, we plot interactions with values>0.10.

Leaf Phosphorus Concentration:Huber Value

Leaf Phosphorus Concentration:Percent Lumen

Percent Lumen:Tree Age

Leaf Phosphorus Concentration:Tree Age

Leaf Area:Thickness to Span Ratio

Light:Branching Distance

Soil Fertility:Huber Value

Slope:Vessel Diameter

Leaf Area:Huber Value

Modulus of Elasticity for Stem:Delta 13C

Branching Distance:Leaf Nitrogen Concentration

Delta 15N:Huber Value

Leaf Phosphorus Concentration:Vessel Diameter

pH:Conductivity per Branch

Slope:Modulus of Elasticity for Stem

Soil Fertility:Temperature

Thickness to Span Ratio:Vessel Diameter

Light:Conductivity per Branch

Leaf Carbon Concentration:Tree Age

Soil Fertility:Branching Distance

Slope:Tree Age

Branching Distance:Huber Value

Delta 15N:Delta 13C

Percent Sapwood:Tree Age

Thickness to Span Ratio:Percent Sapwood

Temperature:Branching Distance

Light:Tree Age

Modulus of Elasticity for Stem:Delta 15N

Branching Distance:Delta 15N

Leaf Mass Per Area:Modulus of Elasticity for Stem

Leaf Nitrogen Concentration:Percent Lumen

Modulus of Elasticity for Stem:Huber Value

Slope:Huber Value

Branching Distance:Julian Date in 2011

Branching Distance:Percent Lumen

Huber Value:Percent Lumen

Vessel Diameter:Percent Sapwood

Leaf Phosphorus Concentration:Conductivity Per Sapwood Area

Thickness to Span Ratio:Huber Value

Thickness to Span Ratio:Tree Age

Modulus of Elasticity for Stem:Branching Distance

pH:Leaf Nitrogen Concentration

Leaf Nitrogen Concentration:Thickness to Span Ratio

Branching Distance:Leaf Carbon Concentration

Temperature:Delta 13C

Modulus of Elasticity for Stem:Conductivity per Branch

Huber Value:Tree Age

Huber Value:Delta 13C

Branching Distance:Tree Age

2.1.13 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

2.2 Model 2: Tree Age + Species Identity + Environmental Conditions

2.2.1 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_64"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 9
## 
## $min_rows
## [1] 4
## 
## $nbins
## [1] 1024
## 
## $nbins_cats
## [1] 32
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3488.306
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.74
## 
## $col_sample_rate
## [1] 0.94
## 
## $col_sample_rate_per_tree
## [1] 0.99
## 
## $min_split_improvement
## [1] 0
## 
## $histogram_type
## [1] "RoundRobin"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
## [1] "Soil.Fertility"   "Light"            "Temperature"      "pH"               "Slope"            "Species"         
## [7] "Tree.Age"         "julian.date.2011"
## 
## $y
## [1] "BAI_GR"

2.2.2 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bai_residuals_species <-
   gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BAI_GR", "julian.date.2011"))) %>%
        mutate(Species = factor(Species)),
      n.trees = 1000,
      interaction.depth = 9, # max depth
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 7, #col_sample_rate 
      bag.fraction = 0.74, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

2.2.3 Relative Importance

First, we look at the importance of variables in the model.

2.2.4 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Species

Tree Age

Julian Date in 2011

2.2.5 Performance

How does the model perform?

2.2.6 Interactions | Table

Let’s explore the interactions in these data.

2.2.7 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 3.6714, df = 3, p-value = 0.2992

2.2.8 Interaction | Group | Violin

2.2.9 Interaction | Group | Boxplot

2.2.10 Interaction | Group | Density

2.2.11 Interactions | Plots

Now, we plot interactions with values>0.10.

Species:Tree Age

Slope:Species

pH:Species

Temperature:Species

Species:Julian Date in 2011

Light:Species

Soil Fertility:pH

Soil Fertility:Species

Light:Tree Age

Soil Fertility:Light

Temperature:pH

Soil Fertility:Julian Date in 2011

Tree Age:Julian Date in 2011

pH:Slope

Temperature:Slope

Soil Fertility:Slope

Temperature:Julian Date in 2011

Light:Temperature

Slope:Tree Age

Soil Fertility:Temperature

2.2.12 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

2.3 Model 3: Tree Age + Species Identity + Plant Trait + Environmental Conditions

2.3.1 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_61"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 9
## 
## $min_rows
## [1] 4
## 
## $nbins
## [1] 16
## 
## $nbins_cats
## [1] 64
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3510.497
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.49
## 
## $col_sample_rate
## [1] 0.86
## 
## $col_sample_rate_per_tree
## [1] 0.44
## 
## $min_split_improvement
## [1] 1e-06
## 
## $histogram_type
## [1] "QuantilesGlobal"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                 "Slope"             
##  [6] "Estem"              "Branching.Distance" "Stem.Wood.Density"  "Leaf.Area"          "LMA"               
## [11] "LCC"                "LNC"                "LPC"                "d15N"               "t.b2"              
## [16] "Ks"                 "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Species"            "Tree.Age"           "julian.date.2011"  
## 
## $y
## [1] "BAI_GR"

2.3.2 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_baiSpeciesAgeEP <-
  gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep,"Species" ,
                        "Tree.Age", "BAI_GR", "julian.date.2011")))%>%
        mutate(Species = factor(Species)),
      n.trees = 1000,
      interaction.depth = 9, #max depth 
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 21, #col_sample_rate 
      bag.fraction =  0.49, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

2.3.3 Relative Importance

First, we look at the importance of variables in the model.

2.3.4 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass Per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity Per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta 13C

Species

Tree Age

Julian Date in 2011

2.3.5 Performance

How does the model perform?

2.3.6 Interactions | Table

Let’s explore the interactions in these data.

2.3.7 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 32.442, df = 3, p-value = 4.224e-07

2.3.8 Interaction | Group | Violin

2.3.9 Interaction | Group | Boxplot

2.3.10 Interaction | Group | Density

2.3.11 Interactions | Plots

Now, we plot interactions with values>0.10

Slope:Vessel Diameter

Light:Branching Distance

Temperature:Branching Distance

Branching Distance:Leaf Phosphorus Concentration

Light:Delta 13C

Temperature:Leaf Phosphorus Concentration

2.3.12 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.